Project: Ensemble Techniques

By: Jacob Siegel

Load Libraries

Load Data Set

The original excel file provided was converted to a csv file to load easier.

several columns need to be reclassified from object to categorical vairable.

Check For Duplicates

Check for missing values

several colums of data have missing values and 84.5 percent of the rows have no missing values. The missing values will be filled in below, and a new column added to denote how many missing values the rows contained. This will be used as to create an alternative model

There are 920 people who took the product and 3968 who did not

EDA and Data Clean Up

Univariate Analysis

Categorical Variables

Gender category has a typo with 'Fe Male', which will be corrected back to 'Female'

Continuous Variables

There are a few outliers in 'DurationofPitch' and 'MonthylIncome' that will be removed.

Bivariate Analysis

Make a plot to look at monthyl income vs age and product taken

Model Building

Expand the data set to include dummy varialbes, and create a second data set that has all rows that previously contained missign values removed.

The following function is modified from a class example to make a confussion matrix.

The following function comes from a class example to measure the model statistics.

AdaBoost Classifier

The precisiion of the test data is low.

Gradient Boosting Classifier

The graident boost classifier has an improved precission.

XGBoost Classifier

the xgb boost has the best preccision of the default models.

XGBoost Classifier with missing data removed

One more model will be run that has rows removed that contained missing values. The previous model had missing values filled in with averages.

It appears the removing the data that had missing values (as apposed to fillign in the missing values with averages) has a better fir to the data and will be used going orward with the models.

Hyperparameter Tuning

AdaBoost Classifier

Gradient Boosting Classifier

XGBoost Classifier

The following work flow comes from a class example to compare all the model results.

Xboost with default parameters has the best accuracy compared to the other models.